Reporte_Hito

Introduction

We decided to work on a dataset about terrorism because we believe that nowdays it’s a concerning topic for many people in many countries. We encountered a huge dataset published and updated by the University of Maryland. Here there is a little glimpse of how it’s done:

library(ggplot2)
library(raster)

## Loading required package: sp

library(readr)
setwd("C:/Users/felip/Desktop/Análisis de Datos")
globalterrorism <- read_csv("globalterrorismdb_0718dist.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   approxdate = col_character(),
##   resolution = col_character(),
##   country_txt = col_character(),
##   region_txt = col_character(),
##   provstate = col_character(),
##   city = col_character(),
##   location = col_character(),
##   summary = col_character(),
##   alternative_txt = col_character(),
##   attacktype1_txt = col_character(),
##   attacktype2_txt = col_character(),
##   attacktype3 = col_logical(),
##   attacktype3_txt = col_logical(),
##   targtype1_txt = col_character(),
##   targsubtype1_txt = col_character(),
##   corp1 = col_character(),
##   target1 = col_character(),
##   natlty1_txt = col_character(),
##   targtype2_txt = col_character(),
##   targsubtype2_txt = col_character()
##   # ... with 48 more columns
## )

## See spec(...) for full column specifications.

## Warning: 7113 parsing failures.
##  row             col           expected                                                                                                                                                                                     actual                             file
## 1687 ransomamtus     1/0/T/F/TRUE/FALSE 20000                                                                                                                                                                                      'globalterrorismdb_0718dist.csv'
## 1687 ransomnote      1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender.  However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3     1/0/T/F/TRUE/FALSE 2                                                                                                                                                                                          'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault                                                                                                                                                                              'globalterrorismdb_0718dist.csv'
## 3432 ransomnote      1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands.                                                                                            'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.

head(globalterrorism)

## # A tibble: 6 x 135
##   eventid iyear imonth  iday approxdate extended resolution country
##     <dbl> <dbl>  <dbl> <dbl> <chr>         <dbl> <chr>        <dbl>
## 1 1.97e11  1970      7     2 <NA>              0 <NA>            58
## 2 1.97e11  1970      0     0 <NA>              0 <NA>           130
## 3 1.97e11  1970      1     0 <NA>              0 <NA>           160
## 4 1.97e11  1970      1     0 <NA>              0 <NA>            78
## 5 1.97e11  1970      1     0 <NA>              0 <NA>           101
## 6 1.97e11  1970      1     1 <NA>              0 <NA>           217
## # ... with 127 more variables: country_txt <chr>, region <dbl>,
## #   region_txt <chr>, provstate <chr>, city <chr>, latitude <dbl>,
## #   longitude <dbl>, specificity <dbl>, vicinity <dbl>, location <chr>,
## #   summary <chr>, crit1 <dbl>, crit2 <dbl>, crit3 <dbl>, doubtterr <dbl>,
## #   alternative <dbl>, alternative_txt <chr>, multiple <dbl>,
## #   success <dbl>, suicide <dbl>, attacktype1 <dbl>,
## #   attacktype1_txt <chr>, attacktype2 <dbl>, attacktype2_txt <chr>,
## #   attacktype3 <lgl>, attacktype3_txt <lgl>, targtype1 <dbl>,
## #   targtype1_txt <chr>, targsubtype1 <dbl>, targsubtype1_txt <chr>,
## #   corp1 <chr>, target1 <chr>, natlty1 <dbl>, natlty1_txt <chr>,
## #   targtype2 <dbl>, targtype2_txt <chr>, targsubtype2 <dbl>,
## #   targsubtype2_txt <chr>, corp2 <chr>, target2 <chr>, natlty2 <dbl>,
## #   natlty2_txt <chr>, targtype3 <dbl>, targtype3_txt <chr>,
## #   targsubtype3 <dbl>, targsubtype3_txt <chr>, corp3 <chr>,
## #   target3 <chr>, natlty3 <dbl>, natlty3_txt <chr>, gname <chr>,
## #   gsubname <chr>, gname2 <chr>, gsubname2 <lgl>, gname3 <lgl>,
## #   gsubname3 <lgl>, motive <chr>, guncertain1 <dbl>, guncertain2 <dbl>,
## #   guncertain3 <lgl>, individual <dbl>, nperps <dbl>, nperpcap <dbl>,
## #   claimed <dbl>, claimmode <dbl>, claimmode_txt <chr>, claim2 <dbl>,
## #   claimmode2 <lgl>, claimmode2_txt <lgl>, claim3 <lgl>,
## #   claimmode3 <lgl>, claimmode3_txt <lgl>, compclaim <lgl>,
## #   weaptype1 <dbl>, weaptype1_txt <chr>, weapsubtype1 <dbl>,
## #   weapsubtype1_txt <chr>, weaptype2 <dbl>, weaptype2_txt <chr>,
## #   weapsubtype2 <dbl>, weapsubtype2_txt <chr>, weaptype3 <dbl>,
## #   weaptype3_txt <chr>, weapsubtype3 <dbl>, weapsubtype3_txt <chr>,
## #   weaptype4 <lgl>, weaptype4_txt <lgl>, weapsubtype4 <lgl>,
## #   weapsubtype4_txt <lgl>, weapdetail <chr>, nkill <dbl>, nkillus <dbl>,
## #   nkillter <dbl>, nwound <dbl>, nwoundus <dbl>, nwoundte <dbl>,
## #   property <dbl>, propextent <dbl>, propextent_txt <chr>,
## #   propvalue <dbl>, ...

dim(globalterrorism)

## [1] 181691    135

colnames(globalterrorism)

##   [1] "eventid"            "iyear"              "imonth"            
##   [4] "iday"               "approxdate"         "extended"          
##   [7] "resolution"         "country"            "country_txt"       
##  [10] "region"             "region_txt"         "provstate"         
##  [13] "city"               "latitude"           "longitude"         
##  [16] "specificity"        "vicinity"           "location"          
##  [19] "summary"            "crit1"              "crit2"             
##  [22] "crit3"              "doubtterr"          "alternative"       
##  [25] "alternative_txt"    "multiple"           "success"           
##  [28] "suicide"            "attacktype1"        "attacktype1_txt"   
##  [31] "attacktype2"        "attacktype2_txt"    "attacktype3"       
##  [34] "attacktype3_txt"    "targtype1"          "targtype1_txt"     
##  [37] "targsubtype1"       "targsubtype1_txt"   "corp1"             
##  [40] "target1"            "natlty1"            "natlty1_txt"       
##  [43] "targtype2"          "targtype2_txt"      "targsubtype2"      
##  [46] "targsubtype2_txt"   "corp2"              "target2"           
##  [49] "natlty2"            "natlty2_txt"        "targtype3"         
##  [52] "targtype3_txt"      "targsubtype3"       "targsubtype3_txt"  
##  [55] "corp3"              "target3"            "natlty3"           
##  [58] "natlty3_txt"        "gname"              "gsubname"          
##  [61] "gname2"             "gsubname2"          "gname3"            
##  [64] "gsubname3"          "motive"             "guncertain1"       
##  [67] "guncertain2"        "guncertain3"        "individual"        
##  [70] "nperps"             "nperpcap"           "claimed"           
##  [73] "claimmode"          "claimmode_txt"      "claim2"            
##  [76] "claimmode2"         "claimmode2_txt"     "claim3"            
##  [79] "claimmode3"         "claimmode3_txt"     "compclaim"         
##  [82] "weaptype1"          "weaptype1_txt"      "weapsubtype1"      
##  [85] "weapsubtype1_txt"   "weaptype2"          "weaptype2_txt"     
##  [88] "weapsubtype2"       "weapsubtype2_txt"   "weaptype3"         
##  [91] "weaptype3_txt"      "weapsubtype3"       "weapsubtype3_txt"  
##  [94] "weaptype4"          "weaptype4_txt"      "weapsubtype4"      
##  [97] "weapsubtype4_txt"   "weapdetail"         "nkill"             
## [100] "nkillus"            "nkillter"           "nwound"            
## [103] "nwoundus"           "nwoundte"           "property"          
## [106] "propextent"         "propextent_txt"     "propvalue"         
## [109] "propcomment"        "ishostkid"          "nhostkid"          
## [112] "nhostkidus"         "nhours"             "ndays"             
## [115] "divert"             "kidhijcountry"      "ransom"            
## [118] "ransomamt"          "ransomamtus"        "ransompaid"        
## [121] "ransompaidus"       "ransomnote"         "hostkidoutcome"    
## [124] "hostkidoutcome_txt" "nreleased"          "addnotes"          
## [127] "scite1"             "scite2"             "scite3"            
## [130] "dbsource"           "INT_LOG"            "INT_IDEO"          
## [133] "INT_MISC"           "INT_ANY"            "related"

As we can see the dataset is pretty large both considering the rows and the columns. Looking more deeply at the columns it’s clear that there are a lot of numerical and categorical features. This is would be particularly useful in the framework of the classification since it’s possible to use as a “labels” the country, the region, the year, the target of the attack and many other. Particularly interesting would be also performing clustering over this dataset as we could discover hidden relationships between the data. ThereÂ´s also a big oportunity to implement and compare different types of dimensionality reduction techniques. Moreover one thing that could be useful and also interesting could be making prediction in order to know where and when could happen something having certain characteristics.

Coming now to the objectives of the analysis our main questions are: - in which countries is justified the fear for terrorism? - is it possible to recognize some paths characterizing some regions/states/years? - is it possible to predict some features of the terrorism events in the folowwing year?

Geographical data visualization

The first thing we wanted to do was to visualize the trend for each year concerning the number of events occurring:

yearly <- rep(0, length(unique(globalterrorism$iyear)))

for( i in 1:length(unique(globalterrorism$iyear))){
  
  yearly[i] <- nrow(globalterrorism[which(globalterrorism$iyear==unique(globalterrorism$iyear)[i]),])
  
}

yearly <- data.frame(cbind(unique(globalterrorism$iyear),yearly))
colnames(yearly) <- c("year","count")

ggplot(yearly,aes(x=year,y=count)) + geom_point(size=3.5) + geom_line(size=1)

It’s clear that as the years pass by the trend of the events/yearyear has always been increasing (speceally in the last decade). Let’s see now if also the total number of deaths/year has increased:

yearly_deaths <- aggregate(nkill ~ iyear, globalterrorism, FUN=sum)

ggplot(yearly_deaths, aes(x=iyear,y=nkill)) + geom_point(size=3.5) + geom_line(size=1)

What immediatly catched our attention is the different scale for the events and deaths. This shows us that the number of events is generally greater than the number of deaths. We will have to pay attention to this because the concept of “terroristic attack” could embrace also very small events. By the way we see that also the trend for the deaths is increasing year after year.

Next we focused on what happen considering the total number of events for each country:

stately <- rep(0, length(unique(globalterrorism$country_txt)))

for( i in 1:length(unique(globalterrorism$country_txt))){
  
  stately[i] <- nrow(globalterrorism[which(globalterrorism$country_txt==unique(globalterrorism$country_txt)[i]),])
  
}

stately <- data.frame(cbind(unique(globalterrorism$country_txt),stately))
colnames(stately) <- c("country_txt","count")

stately$count <- as.numeric(as.character(stately$count))

stately_ord <- stately[order(stately$count,decreasing=TRUE),]

stately_ord_15 <- stately_ord[1:15,]

ggplot(stately_ord_15,aes(x=reorder(country_txt,count),y=count)) + 
  geom_bar(stat = "identity") + 
  xlab("Country") +
  coord_flip()

The natuaral prosecution of the last question is to consider the total number of deaths for each country:

state_deaths <- aggregate(nkill ~ country_txt, globalterrorism, FUN=sum)
state_deaths_ord <- state_deaths[order(state_deaths$nkill,decreasing = TRUE),]
state_deaths_15 <- state_deaths_ord[1:15,]

ggplot(state_deaths_15,aes(x=reorder(country_txt,nkill),y=nkill)) + 
  geom_bar(stat="identity") + 
  xlab("Country") +
  ylab("Deaths") +
  coord_flip()

A noteworthy feature of the 2 plots we just showed is that for some countries a high number of events doesn’t necessary mean a high number of killed people. For this reason we decided to compute the ratio deaths/events for every country:

ratio <- merge(stately,state_deaths,by = "country_txt")
ratio <- data.frame(country_txt = ratio$country_txt, r = ratio$nkill/ratio$count)

ratio_ord <- ratio[order(ratio$r,decreasing = TRUE),]
ratio_15 <- ratio_ord[1:15,]

ggplot(ratio_15,aes(x=reorder(country_txt,r),y=r)) + 
  geom_bar(stat="identity") + 
  xlab("Country") + 
  ylab("deaths/events") +
  coord_flip()

state_deaths_events <- merge(stately,state_deaths,by = "country_txt")

ggplot(state_deaths_events[which(state_deaths_events$count<10000),]) + 
  geom_point(aes(x=count,y=nkill)) + 
  xlab("number of events") + 
  ylab("deaths") + 
  geom_abline(intercept = 0, slope = 1, col = "red")

This plots confirmed what we suspected: there are some countries in which there are less events but more severe. In the scatter-plot is easy to see that there are a lot of data lying in the area with more deaths than events. This is suggesting us to be careful in our future analysis hypothesis and conclusions.

An other significant piece of info is the presence of many african countries in the last three plots.

Similarly to the analysis just presented we made something similar considering now macro-regions:

regionly <- rep(0, length(unique(globalterrorism$region_txt)))

for( i in 1:length(unique(globalterrorism$region_txt))){
  
  regionly[i] <- nrow(globalterrorism[which(globalterrorism$region_txt==unique(globalterrorism$region_txt)[i]),])
  
}

regionly <- data.frame(cbind(unique(globalterrorism$region_txt),regionly))
colnames(regionly) <- c("region_txt","count")

regionly$count <- as.numeric(as.character(regionly$count))

ggplot(regionly,aes(x=reorder(region_txt,count),y=count)) + 
  geom_bar(stat = "identity") + 
  xlab("Country") + 
  coord_flip()

region_deaths <- aggregate(nkill ~ region_txt, globalterrorism, FUN=sum)

ggplot(region_deaths,aes(x=reorder(region_txt,nkill),y=nkill)) + 
  geom_bar(stat="identity") + 
  xlab("Country") +
  ylab("Deaths") +
  coord_flip()

ratio_reg <- merge(regionly,region_deaths,by = "region_txt")
ratio_reg <- data.frame(region_txt = ratio_reg$region_txt, r = ratio_reg$nkill/ratio_reg$count)


ggplot(ratio_reg,aes(x=reorder(region_txt,r),y=r)) + 
  geom_bar(stat="identity") + 
  xlab("Country") + 
  ylab("deaths/events") + 
  coord_flip()

What is very surprising is that ,despite the fact that Middle East & North Africa and South Asia are the one with most high number of events and deaths, the region with the highest ratio is Sub-Saharan Africa.

To have a better understanding of the geographical distribution of the data we then decided to plot the locations with the number of deaths of the events of the type “Bombings/Explosion” (the most common events associated to a terroristic attack, being present in more than half of the events) for every decades:

worldmap <- shapefile("TM_WORLD_BORDERS-0.3.shp")
worldmap_df <- fortify(worldmap)

## Regions defined for each Polygons

#### Filtro para tipo de evento

globalterrorism$attacktype1_txt <- as.factor(globalterrorism$attacktype1_txt)
levels(globalterrorism$attacktype1_txt)

## [1] "Armed Assault"                      
## [2] "Assassination"                      
## [3] "Bombing/Explosion"                  
## [4] "Facility/Infrastructure Attack"     
## [5] "Hijacking"                          
## [6] "Hostage Taking (Barricade Incident)"
## [7] "Hostage Taking (Kidnapping)"        
## [8] "Unarmed Assault"                    
## [9] "Unknown"

#Bombing/Explosion
bombing <- globalterrorism[which(globalterrorism$attacktype1_txt == "Bombing/Explosion"),]
bombing_70_80 <- bombing[which(bombing$iyear>=1970 & bombing$iyear<1980),]
bombing_80_90 <- bombing[which(bombing$iyear>=1980 & bombing$iyear<1990),]
bombing_90_00 <- bombing[which(bombing$iyear>=1990 & bombing$iyear<2000),]
bombing_00_10 <- bombing[which(bombing$iyear>=2000 & bombing$iyear<2010),]
bombing_10_20 <- bombing[which(bombing$iyear>=2010 & bombing$iyear<2020),]


source("plot_world.R")
plot_world(worldmap_df,bombing_70_80,"Bombing 70/80")
plot_world(worldmap_df,bombing_80_90,"Bombing 80/90")
plot_world(worldmap_df,bombing_90_00,"Bombing 90/00")
plot_world(worldmap_df,bombing_00_10,"Bombing 00/10")
plot_world(worldmap_df,bombing_10_20,"Bombing 10/20")

Similarly we explored the feature regarding “Facility/Infrastructure Attack”:

FI_A <- globalterrorism[which(globalterrorism$attacktype1_txt == "Facility/Infrastructure Attack"),]
FI_A_70_80 <- FI_A[which(FI_A$iyear>=1970 & FI_A$iyear<1980),]
FI_A_80_90 <- FI_A[which(FI_A$iyear>=1980 & FI_A$iyear<1990),]
FI_A_90_00 <- FI_A[which(FI_A$iyear>=1990 & FI_A$iyear<2000),]
FI_A_00_10 <- FI_A[which(FI_A$iyear>=2000 & FI_A$iyear<2010),]
FI_A_10_20 <- FI_A[which(FI_A$iyear>=2010 & FI_A$iyear<2020),]

plot_world(worldmap_df,FI_A_70_80,"Facility/Infrastructure Attack 70/80")
plot_world(worldmap_df,FI_A_80_90,"Facility/Infrastructure Attack 80/90")
plot_world(worldmap_df,FI_A_90_00,"Facility/Infrastructure Attack 90/00")
plot_world(worldmap_df,FI_A_00_10,"Facility/Infrastructure Attack 00/10")
plot_world(worldmap_df,FI_A_10_20,"Facility/Infrastructure Attack 10/20")

All these geographical considerations are useful to get an general idea over the dataset and to visualize better some important facts contained in it.

Feature exploration

After having explored in a “geographical” context our data we felt the need to understand better other characteristics of the data such as type, target, and the weapon of the attack.

### ATTACK TYPE

attack <- rep(0, length(unique(globalterrorism$attacktype1_txt)))

for( i in 1:length(unique(globalterrorism$attacktype1_txt))){
  
  attack[i] <- nrow(globalterrorism[which(globalterrorism$attacktype1_txt==unique(globalterrorism$attacktype1_txt)[i]),])
  
}

attack <- data.frame(attack_type = unique(globalterrorism$attacktype1_txt),count = attack)
colnames(attack) <- c("attacktype1_txt","count")

attack$count <- as.numeric(as.character(attack$count))

attack_ord <- attack[order(attack$count,decreasing=TRUE),]

ggplot(attack_ord,aes(x=reorder(attacktype1_txt,count),y=count)) + 
  geom_bar(stat = "identity") + 
  xlab("Attack type") +
  coord_flip()

### Ratio
attack_deaths <- aggregate(nkill ~ attacktype1_txt, globalterrorism, FUN=sum)

ratio_att <- merge(attack,attack_deaths,by = "attacktype1_txt")
ratio_att <- data.frame(attacktype1_txt = ratio_att$attacktype1_txt, r = ratio_att$nkill/ratio_att$count)

ggplot(ratio_att,aes(x=reorder(attacktype1_txt,r),y=r)) + 
  geom_bar(stat = "identity") + 
  xlab("Attack type") +
  ylab("deaths/events") +
  coord_flip()

### Box-plot
ggplot(globalterrorism,aes(x=attacktype1_txt,y=nkill)) +
  geom_boxplot() +
  coord_flip()

## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).

ggplot(globalterrorism,aes(x=attacktype1_txt,y=nkill)) +
  geom_boxplot() +
  ylim(0,25) +
  coord_flip()

## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).

##### TARGET 

target <- rep(0, length(unique(globalterrorism$targtype1_txt)))

for( i in 1:length(unique(globalterrorism$targtype1_txt))){
  
  target[i] <- nrow(globalterrorism[which(globalterrorism$targtype1_txt==unique(globalterrorism$targtype1_txt)[i]),])
  
}

target <- data.frame(target_type = unique(globalterrorism$targtype1_txt), count = target)
colnames(target) <- c("targtype1_txt","count")

target$count <- as.numeric(as.character(target$count))

target_ord <- target[order(target$count,decreasing=TRUE),]

ggplot(target_ord,aes(x=reorder(targtype1_txt,count),y=count)) + 
  geom_bar(stat = "identity") + 
  xlab("Target") +
  coord_flip()

### Ratio
target_deaths <- aggregate(nkill ~ targtype1_txt, globalterrorism, FUN=sum)


ratio_trg <- merge(target,target_deaths,by = "targtype1_txt")
ratio_trg <- data.frame(targtype1_txt = ratio_trg$targtype1_txt, r = ratio_trg$nkill/ratio_trg$count)

ggplot(ratio_trg,aes(x=reorder(targtype1_txt,r),y=r)) + 
  geom_bar(stat = "identity") + 
  xlab("Target type") +
  ylab("deaths/events") +
  coord_flip()

###Boxplot
ggplot(globalterrorism,aes(x=targtype1_txt,y=nkill)) +
  geom_boxplot() +
  coord_flip()

## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).

ggplot(globalterrorism,aes(x=targtype1_txt,y=nkill)) +
  geom_boxplot() +
  ylim(0,25) +
  coord_flip()

## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).

#### WEAPON

weapon <- rep(0, length(unique(globalterrorism$weaptype1_txt)))

for( i in 1:length(unique(globalterrorism$weaptype1_txt))){
  
  weapon[i] <- nrow(globalterrorism[which(globalterrorism$weaptype1_txt==unique(globalterrorism$weaptype1_txt)[i]),])
  
}

weapon <- data.frame(weapon_type = unique(globalterrorism$weaptype1_txt), count = weapon)
colnames(weapon) <- c("weaptype1_txt","count")

weapon$count <- as.numeric(as.character(weapon$count))

weapon_ord <- weapon[order(weapon$count,decreasing=TRUE),]

ggplot(weapon_ord,aes(x=reorder(weaptype1_txt,count),y=count)) + 
  geom_bar(stat = "identity") + 
  xlab("Weapon") +
  coord_flip()

### Ratio
weapon_deaths <- aggregate(nkill ~ weaptype1_txt, globalterrorism, FUN=sum)


ratio_weap <- merge(weapon,weapon_deaths,by = "weaptype1_txt")
ratio_weap <- data.frame(weaptype1_txt = ratio_weap$weaptype1_txt, r = ratio_weap$nkill/ratio_weap$count)

ggplot(ratio_weap,aes(x=reorder(weaptype1_txt,r),y=r)) + 
  geom_bar(stat = "identity") + 
  xlab("weapon type") +
  ylab("deaths/events") +
  coord_flip()

###Boxplot
ggplot(globalterrorism,aes(x=weaptype1_txt,y=nkill)) +
  geom_boxplot() +
  coord_flip()

## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).

ggplot(globalterrorism,aes(x=weaptype1_txt,y=nkill)) +
  geom_boxplot() +
  ylim(0,25) +
  coord_flip()

## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).

What we can extract from the last part of our analysis is that the the phenomenon is very complex and has several aspects that must be considered when dealing with the study of it.

For example taking into account the type of the attack, we make ourselves aware of the fact that despite the hijackings are the last for number of events occurred they are first for the ratio deaths/events. The same happen for the kidnappings, while surprisingly the bombings happen to have a “lower” ratio. This is suggesting that certain types of events occur a lot but doesn’t have a high number of victims. Another fact to be noticed is that for the weapon used the vehicles are the ones with the highest ratio but with an “very low” number of occurencies.

For what concern the boxplots it is to highligth how nearly half of the events have low number of deaths when considering the attack type/target type/weapon type. Moreover the “high” number of deaths occuring happens to be at most a quarter of the whole number.

Conclusions hito I

Finally we think that this is a very good dataset over which discover and visualize many things. We could learn a lot preparing ourselves for more advanced procedures and methods. We also could learn how to manage a huge dataset and extract useful information. Not to be underestimated is also the possibility to work on the columns, as we simply did for the ratios.

After a meeting with the auxiliars we are taking into account the idea to change the dataset in favour of one more coherent with the course objectives, meaning an explicit question to answer via a prediction or classification. In spite of that, the feedback during our first presentation (given by the teacher) gave us new ideas to work through this DS. Two of the big questions to solve are:

-Present a cluster with different kind of terrorist events. -Predict some features of future terrorism events (location, number of deaths, etc..)

To achieve it we have further work to do on the terrorism data set.

Introduction hito II

After having explored the data and discovered interesting facts such as trends concerning attacks, weapons, targets and countries we decided to go deeper in our analysis. In particular in this part of the work we will show, we are going to perform some classification and clustering analysis. Concerning classification we will try to create mainly two model: the first able to predict what kind of attack has certain attributes and second one able to predict the region given the features. On the opposite what we will try to do with clustering is to discover unknown relationships between the data, such as the number of k divitions of the data. Afterwards we will procede to validate the clustering via comparison between the cluster and a cluster of random noise (comparing the correlation of matrices of incidence and distance, as we will show later on the report). ## Data pre-processing

In this part we are preparing the data for the whole analysis. Mainly it’s about dealing with Na’s and selecting the relevant features we are interested in. Firstly we filtered the data eliminating the rows that were not considered for sure as a terrorist attack. Then we proceeded by selecting the features of interests and creating a new binary variable to distinguish between events with deaths or not. Moreover we eliminated those observations having the number of deaths or wounded such that they could be outliers. The rest is explained pretty clearly in the code below:

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(123)

data <-read_csv("globalterrorismdb_0718dist.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   approxdate = col_character(),
##   resolution = col_character(),
##   country_txt = col_character(),
##   region_txt = col_character(),
##   provstate = col_character(),
##   city = col_character(),
##   location = col_character(),
##   summary = col_character(),
##   alternative_txt = col_character(),
##   attacktype1_txt = col_character(),
##   attacktype2_txt = col_character(),
##   attacktype3 = col_logical(),
##   attacktype3_txt = col_logical(),
##   targtype1_txt = col_character(),
##   targsubtype1_txt = col_character(),
##   corp1 = col_character(),
##   target1 = col_character(),
##   natlty1_txt = col_character(),
##   targtype2_txt = col_character(),
##   targsubtype2_txt = col_character()
##   # ... with 48 more columns
## )

## See spec(...) for full column specifications.

## Warning: 7113 parsing failures.
##  row             col           expected                                                                                                                                                                                     actual                             file
## 1687 ransomamtus     1/0/T/F/TRUE/FALSE 20000                                                                                                                                                                                      'globalterrorismdb_0718dist.csv'
## 1687 ransomnote      1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender.  However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3     1/0/T/F/TRUE/FALSE 2                                                                                                                                                                                          'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault                                                                                                                                                                              'globalterrorismdb_0718dist.csv'
## 3432 ransomnote      1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands.                                                                                            'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.

dim(data)

## [1] 181691    135

# doubtterr is a binary variable describing the doubts about the "reliability" of the event
df <- data %>% filter(doubtterr == 0   & nkill >= 0)

rm(data)

df <- df %>%
  
  dplyr::select(
    # Time
    iyear,   # Year
    imonth,  # Month
    iday,    # Day
    
    # Geospatial
    city,      # City
    latitude,  # Geo coordinate
    longitude, # Geo coordinate
    country,   # Country id
    region,    # Region id, see docs for details
    
    # Numerical
    nperps,    # Number of perpetrators
    nkill,     # Death toll
    nwound,    # No of casualties
    nkillter,  # Death toll - terrorist only
    
    # Binary
    vicinity,  # Event occured in city or next to it, binary          
    ishostkid, # Any hostages ?
    ransom,    # Any ransom demanded ? 
    extended,  # Event duration above 24hrs
    
    # Categorical
    country_txt,     # Desc country
    region_txt,      # Desc region  
    attacktype1_txt, # Desc attack type
    weaptype1_txt,   # Desc weapon type
    targtype1_txt,   # Desc target type
    propextent,      # Amount of damage done          
    
    # Text
    summary, # Motive text variable
    gname)  # Name of organization

long_name <- "Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Vehicle"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)

long_name <- "Explosives/Bombs/Dynamite"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Explosives"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)

### More reduced data

ml_df <- df %>% 
  # Initial filter
  dplyr::filter(nkill >= 0 & 
                  nwound >= 0 & 
                  (longitude >= -180 & longitude <= 200) &
                  (latitude  >= -90  & latitude  <= 90) & 
                  (imonth >= 1 & imonth <= 12)) %>%
  
  # Remove outliers
  dplyr::filter( !(abs(nkill-mean(nkill))/sd(nkill) > 4)) %>%
  dplyr::filter( !(abs(nwound-mean(nwound))/sd(nwound) > 4)) %>%
  
  # Clean dataset
  dplyr::mutate(
    # Fill with median
    #nperps = ifelse(is.na(nperps) | nperps < 1, perp_median, nperps),
    
    # Check binary variables
    vicinity = ifelse((vicinity != 0 & vicinity != 1) | is.na(vicinity), 0, vicinity),
    extended = ifelse((extended != 0 & extended != 1) | is.na(extended), 0, extended),
    ransom = ifelse((ransom != 0 & ransom != 1) | is.na(ransom), 0, ransom),
    ishostkid = ifelse((ishostkid != 0 & ishostkid != 1) | is.na(ishostkid), 0, ishostkid),
    propextent = ifelse( (propextent < 1 | propextent > 4) | is.na(propextent), 4, propextent)) %>%
  
  
  # Feature engineering
  dplyr::mutate(
     # Group known
     gknow = ifelse(gname != "Unknown", 1, 0),
 
     # Scale coordinates
      lat = as.vector(scale(latitude)),
      long = as.vector(scale(longitude))) %>%
  
  # Create binary target
  dplyr::mutate(target = ifelse(round(nkill, 0) > 0, 1, 0)) %>%
  
  # Recode vars, scale numericals, convert cat's to factors.
  dplyr::mutate(month = factor(imonth), 
                cnty  = factor(country_txt),
                rgn   = factor(region_txt),
                vnity = factor(vicinity), 
                ext   = factor(extended),
                rans  = factor(ransom),
                host  = factor(ishostkid),
                atype = factor(attacktype1_txt),
                wtype = factor(weaptype1_txt),
                ttype = factor(targtype1_txt),
                gknow = factor(gknow),
                prop  = factor(propextent),
                trgt  = factor(target),
                lat   = as.vector(scale(latitude)),
                long  = as.vector(scale(longitude))) %>% 
               
  # Select clean data
  dplyr::select(rgn, vnity, ext, rans, host, atype, wtype, ttype, 
                prop, target)

Classification analysis

As just explained in the introduction we first try to create a model to predict the attack type of an event based on all the features we previously selected such as: - region - weapon - target - killed (yes or no) - …

To do that in this analysis we are going to use Naive Bayes and Decision Trees algorithms as the following lines of code show:

#### CLASSIFICATION

library(caret)

## Loading required package: lattice

trainIndex=createDataPartition(ml_df$atype, p=0.7)$Resample1
train=ml_df[trainIndex, ]
test=ml_df[-trainIndex, ]

printALL=function(model,train_class,test_class){
  trainPred=predict(model, newdata = train, type = "class")
  trainTable=table(train_class, trainPred)
  testPred=predict(model, newdata=test, type="class")
  testTable=table(test_class, testPred)
  trainAcc=(sum(diag(trainTable)))/sum(trainTable)
  testAcc=(sum(diag(testTable)))/sum(testTable)
  message("Accuracy")
  print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}


# Naive Bayes

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:raster':
## 
##     interpolate

NBclassifier=naiveBayes(atype ~ . , data=train, laplace = 1,na.action = na.omit)

printALL(NBclassifier,train$atype,test$atype)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.855        0.855

# Decision Trees

library(rpart)
library(rpart.plot)

DTclassifier <- rpart(atype ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)

printALL(DTclassifier,train$atype,test$atype)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]          0.85        0.849

#SVM
library(e1071)
small_train <- train[sample(1:10000),,drop=FALSE]
SVM_Lin_Classifier <- svm(atype~., small_train,  method="C-classification", kernel="linear")
printALL(SVM_Lin_Classifier,train$atype,test$atype)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.773        0.772

SVM_Pol_Classifier <- svm(atype~., small_train,  method="C-classification", kernel="polynomial")
printALL(SVM_Pol_Classifier,train$atype,test$atype)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]          0.56         0.56

As we can see the two first classification methods work pretty well, giving an accuracy of 85% both for the training and testing sets. Particularly interesting is the Decision tree plot that shows exactly how the algorithm assign to a certain observation its class. For what concern the Naive Bayes algorithm some interesting things that can be seen are the prior probabilities and the conditional probabilities given the target class. With these two ingredients one can compute the posterior probabilities for the target class. For example the probability that an attack is a Bombing/Explosion given the fact that it take place in Middle East & North Africa is p(ME&NA|B/E)p(B/E) = 0.36 x 0.55 = 0.2. Another example is the probability that the event is a Facility/Infrastructure Attack given that there are no deaths (target = 0) is p(target = 0 |F/I A)p(F/I A) = 0.95 x 0.05 = 0.0475. We clearly see how the likelihood (what we actually see from the data) is updated through the prior to get the posterior. A 95% probability observing the data can be turned, thanks to Bayes rule, in a relatively small probability weighted on the priors. The last example we want to highlight in order to underline the the importance to use both the probabilities is the following: the probability of a Bombing/Explosion to happen given that it occurred in a city (vnity = 0) is 0.94 * 0.55 = 0.52.

In relation of SVM methods, use of liner kernel give better results than a polinomial kernel. But both results are worst than Niave Bayes and Decision Tree.

We were interested also in creating a classifier able to distinguish between the regions were an event takes place, and so we tried to perform the same algorithm described before also for this case:

# Naive Bayes

NBclassifier=naiveBayes(rgn ~ . , data=train, laplace = 1,na.action = na.omit)

printALL(NBclassifier,train$rgn,test$rgn)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.357        0.353

# Decision Trees

DTclassifier <- rpart(rgn ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)

printALL(DTclassifier,train$rgn,test$rgn)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.341        0.347

As it is possible to see from the accuracy, in this case the algorithm has less ability to predict the region where the event happen. Why is this the case? The answers could be a lot but we think that we are excluding some variables important to characterize the region of an observation. By the way we are pretty satisfied also in this case about the result obtained.

Also, the group use number of killed people as a variable to predict. But in a lot of cases there aren’t killed people in attacks. For that reason, we create a variable “target” with value 1 if the attack had kills and value 0 if attack hadn’t kills. We use “target” as a variable to predict using data mining methods.

set.seed(123)

train$target<-as.factor(train$target)
test$target<-as.factor(test$target)

# Naive Bayes
NBclassifier=naiveBayes(target ~ . , data=train, laplace = 1,na.action = na.omit)
printALL(NBclassifier,train$target,test$target)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.731        0.727

# Decision Trees

DTclassifier <- rpart(target ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)

printALL(DTclassifier,train$target,test$target)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.734        0.734

#SVM
small_train <- train[sample(1:10000),,drop=FALSE]
SVM_Lin_Classifier <- svm(target~., small_train,  method="C-classification", kernel="linear")
printALL(SVM_Lin_Classifier,train$target,test$target)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.704          0.7

SVM_Pol_Classifier <- svm(target~., small_train,  method="C-classification", kernel="polynomial")
printALL(SVM_Pol_Classifier,train$target,test$target)

## Accuracy

##      trainAccuracy testAccuracy
## [1,]         0.551        0.552

Just like the first analysis predicting attack type, we look better results in Naive Bayes and Decision Trees, but we look good results in SVM Model with Linear Kernel and that model is better than Polynomial Kernel SVM.

Clustering

in this section we will focus on clustering and validation of the respective results. First we install librearies. It worth to notice that in this section we will re define the data from the original.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:raster':
## 
##     shift

library(mltools)

## 
## Attaching package: 'mltools'

## The following object is masked from 'package:e1071':
## 
##     skewness

library(cluster)
library(stats)
library(readr)
library(dplyr)

A quick View of the original data.

setwd("C:/Users/felip/Desktop/Análisis de Datos")
data<- read_csv("globalterrorismdb_0718dist.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   approxdate = col_character(),
##   resolution = col_character(),
##   country_txt = col_character(),
##   region_txt = col_character(),
##   provstate = col_character(),
##   city = col_character(),
##   location = col_character(),
##   summary = col_character(),
##   alternative_txt = col_character(),
##   attacktype1_txt = col_character(),
##   attacktype2_txt = col_character(),
##   attacktype3 = col_logical(),
##   attacktype3_txt = col_logical(),
##   targtype1_txt = col_character(),
##   targsubtype1_txt = col_character(),
##   corp1 = col_character(),
##   target1 = col_character(),
##   natlty1_txt = col_character(),
##   targtype2_txt = col_character(),
##   targsubtype2_txt = col_character()
##   # ... with 48 more columns
## )

## See spec(...) for full column specifications.

## Warning: 7113 parsing failures.
##  row             col           expected                                                                                                                                                                                     actual                             file
## 1687 ransomamtus     1/0/T/F/TRUE/FALSE 20000                                                                                                                                                                                      'globalterrorismdb_0718dist.csv'
## 1687 ransomnote      1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender.  However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3     1/0/T/F/TRUE/FALSE 2                                                                                                                                                                                          'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault                                                                                                                                                                              'globalterrorismdb_0718dist.csv'
## 3432 ransomnote      1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands.                                                                                            'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.

dim(data)

## [1] 181691    135

We will only leave the data corresponding to numerical values for the first clustering. We call this value “data1”

data1<-data[,c(2,3, 4, 6 ,8,10,16,17,20,21,22,23,24,26,27,28,29,31,33,35,37,41,43,45,49,51,53,57, 66,  68,  69 , 70,72  ,73,75 , 76,78,79 , 81,  82,84,86,88,90,92, 94,96,99, 100,101,103, 104, 105, 106, 110, 111, 112, 114, 117, 121,  123, 125, 131, 132, 133, 134 )] #We select cols matching the criteria

data1<- as.matrix(data1) #transform to matrix

head(data1)

##      iyear imonth iday extended country region specificity vicinity crit1
## [1,]  1970      7    2        0      58      2           1        0     1
## [2,]  1970      0    0        0     130      1           1        0     1
## [3,]  1970      1    0        0     160      5           4        0     1
## [4,]  1970      1    0        0      78      8           1        0     1
## [5,]  1970      1    0        0     101      4           1        0     1
## [6,]  1970      1    1        0     217      1           1        0     1
##      crit2 crit3 doubtterr alternative multiple success suicide
## [1,]     1     1         0          NA        0       1       0
## [2,]     1     1         0          NA        0       1       0
## [3,]     1     1         0          NA        0       1       0
## [4,]     1     1         0          NA        0       1       0
## [5,]     1     1        -9          NA        0       1       0
## [6,]     1     1         0          NA        0       1       0
##      attacktype1 attacktype2 attacktype3 targtype1 targsubtype1 natlty1
## [1,]           1          NA          NA        14           68      58
## [2,]           6          NA          NA         7           45      21
## [3,]           1          NA          NA        10           54     217
## [4,]           3          NA          NA         7           46     217
## [5,]           7          NA          NA         7           46     217
## [6,]           2          NA          NA         3           22     217
##      targtype2 targsubtype2 natlty2 targtype3 targsubtype3 natlty3
## [1,]        NA           NA      NA        NA           NA      NA
## [2,]        NA           NA      NA        NA           NA      NA
## [3,]        NA           NA      NA        NA           NA      NA
## [4,]        NA           NA      NA        NA           NA      NA
## [5,]        NA           NA      NA        NA           NA      NA
## [6,]        NA           NA      NA        NA           NA      NA
##      guncertain1 guncertain3 individual nperps claimed claimmode claim2
## [1,]           0          NA          0     NA      NA        NA     NA
## [2,]           0          NA          0      7      NA        NA     NA
## [3,]           0          NA          0     NA      NA        NA     NA
## [4,]           0          NA          0     NA      NA        NA     NA
## [5,]           0          NA          0     NA      NA        NA     NA
## [6,]           0          NA          0    -99       0        NA     NA
##      claimmode2 claim3 claimmode3 compclaim weaptype1 weapsubtype1
## [1,]         NA     NA         NA        NA        13           NA
## [2,]         NA     NA         NA        NA        13           NA
## [3,]         NA     NA         NA        NA        13           NA
## [4,]         NA     NA         NA        NA         6           16
## [5,]         NA     NA         NA        NA         8           NA
## [6,]         NA     NA         NA        NA         5            5
##      weaptype2 weapsubtype2 weaptype3 weapsubtype3 weaptype4 weapsubtype4
## [1,]        NA           NA        NA           NA        NA           NA
## [2,]        NA           NA        NA           NA        NA           NA
## [3,]        NA           NA        NA           NA        NA           NA
## [4,]        NA           NA        NA           NA        NA           NA
## [5,]        NA           NA        NA           NA        NA           NA
## [6,]        NA           NA        NA           NA        NA           NA
##      nkill nkillus nkillter nwoundus nwoundte property propextent
## [1,]     1      NA       NA       NA       NA        0         NA
## [2,]     0      NA       NA       NA       NA        0         NA
## [3,]     1      NA       NA       NA       NA        0         NA
## [4,]    NA      NA       NA       NA       NA        1         NA
## [5,]    NA      NA       NA       NA       NA        1         NA
## [6,]     0       0        0        0        0        1          3
##      ishostkid nhostkid nhostkidus ndays ransom ransompaidus
## [1,]         0       NA         NA    NA      0           NA
## [2,]         1        1          0    NA      1           NA
## [3,]         0       NA         NA    NA      0           NA
## [4,]         0       NA         NA    NA      0           NA
## [5,]         0       NA         NA    NA      0           NA
## [6,]         0       NA         NA    NA      0           NA
##      hostkidoutcome nreleased INT_LOG INT_IDEO INT_MISC INT_ANY
## [1,]             NA        NA       0        0        0       0
## [2,]             NA        NA       0        1        1       1
## [3,]             NA        NA      -9       -9        1       1
## [4,]             NA        NA      -9       -9        1       1
## [5,]             NA        NA      -9       -9        1       1
## [6,]             NA        NA      -9       -9        0      -9

dim(data1) #(checking the length of col)

## [1] 181691     66

Now there is an analysis of the NA´s existing in data1, it is donemanually, and bellow we present the number of NA´s in each col.

Cols in data1 with value==NA: (# of NA´s)

7 (6),12(1),13(152680),14(1),18(175377),19(181263),21(10373),22(1559),23(170547),24(171006),25(170863),26(180515),27(180594),28(180544),29(380),30(181371),32(71115),33(66120 ),34(162608),35(179801),36(181075),37(181373),38(181558),39(176852),41(20768),42(168564),43(170149),44(179828),45(179998),46(181618),47(181621),48(10313),49(64446),50(66958),51 (64702),52(69143),54(117626),55(178),56(168119),57(168174),58(173567),59(104310),60(181139 ),61(170700),62(171291)

Given the high number of NA´s in several cols, we impose a criteria where we delete any col with a number of NA´s higher than (approximately) 10% of the total of observations (more than 17000). Then, with the remaining cols any NA value is replaced with the mean of the col.

data2<-data1[,c(1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,20,21,22,29,31,40,48,53,55 )]

data3<-data1[,c(1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,20,21,22,29,31,40,48,53,55,63,64,65,66 )]


for(i in 1:ncol(data2)){
  data2[is.na(data2[,i]), i] <- mean(data2[,i], na.rm = TRUE)
}

for(i in 1:ncol(data3)){
  data3[is.na(data3[,i]), i] <- mean(data3[,i], na.rm = TRUE)
}

#summary(data2)
dim(data2)

## [1] 181691     25

#head(data3)
dim(data3)

## [1] 181691     29

We normalize each col, utilizing one of the normalization presented in class (range 0 to 1):

#summary(data2[,1])

data2[,1]=(data2[,1]-min(data2[,1]))/(max(data2[,1])-min(data2[,1]))
min(data2[,1]) #example shown for col number 2

## [1] 0

for (i in 1:25){
  data2[,i] <-(data2[,i]-min(data2[,i]))/(max(data2[,i])-min(data2[,i]))
}

for (i in 1:29){
  data3[,i] <-(data3[,i]-min(data3[,i]))/(max(data3[,i])-min(data3[,i]))
}

#apply the same for each col
summary(data2) #safety Check

##      iyear            imonth            iday           extended      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.4468   1st Qu.:0.3333   1st Qu.:0.2581   1st Qu.:0.00000  
##  Median :0.8298   Median :0.5000   Median :0.4839   Median :0.00000  
##  Mean   :0.6944   Mean   :0.5389   Mean   :0.5002   Mean   :0.04535  
##  3rd Qu.:0.9362   3rd Qu.:0.7500   3rd Qu.:0.7419   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     country          region        specificity        vicinity     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.074   1st Qu.:0.3636   1st Qu.:0.0000   1st Qu.:0.9000  
##  Median :0.094   Median :0.4545   Median :0.0000   Median :0.9000  
##  Mean   :0.128   Mean   :0.5601   Mean   :0.1129   Mean   :0.9068  
##  3rd Qu.:0.156   3rd Qu.:0.8182   3rd Qu.:0.0000   3rd Qu.:0.9000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      crit1            crit2            crit3          doubtterr     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.9000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.9000  
##  Mean   :0.9885   Mean   :0.9931   Mean   :0.8757   Mean   :0.8477  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.9000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     multiple         success          suicide         attacktype1    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.1250  
##  Median :0.0000   Median :1.0000   Median :0.00000   Median :0.2500  
##  Mean   :0.1378   Mean   :0.8896   Mean   :0.03651   Mean   :0.2809  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.2500  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##    targtype1        targsubtype1       natlty1        guncertain1     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.09524   1st Qu.:0.1964   1st Qu.:0.0790   1st Qu.:0.00000  
##  Median :0.14286   Median :0.3393   Median :0.1000   Median :0.00000  
##  Mean   :0.35427   Mean   :0.4105   Mean   :0.1237   Mean   :0.08144  
##  3rd Qu.:0.61905   3rd Qu.:0.6429   3rd Qu.:0.1640   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    individual        weaptype1          nkill              property     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.3333   1st Qu.:0.0000000   1st Qu.:0.9000  
##  Median :0.00000   Median :0.4167   Median :0.0006369   Median :1.0000  
##  Mean   :0.00295   Mean   :0.4539   Mean   :0.0015307   Mean   :0.8455  
##  3rd Qu.:0.00000   3rd Qu.:0.4167   3rd Qu.:0.0012739   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000000   Max.   :1.0000  
##    ishostkid     
##  Min.   :0.0000  
##  1st Qu.:0.9000  
##  Median :0.9000  
##  Mean   :0.9059  
##  3rd Qu.:0.9000  
##  Max.   :1.0000

#dim(data2)

As we can see now we have the normalized numerical Data!

We now proceed to use k means algorithm for clustering. We use k up to 10 cluster and check for the point where a higher k does not change the square sum of errors so drastically.

wss <- 0
clust <- 10 # graficaremos hasta 10 clusters

for (i in 1:clust){
  wss[i] <-
    sum(kmeans(data3, centers=i, nstart=20)$withinss)
}

plot(1:clust, wss, type="b", xlab="Numero de clusters", ylab="wss")

wss <- 0
clust <- 10 # graficaremos hasta 10 clusters

for (i in 1:clust){
  wss[i] <-
    sum(kmeans(data2, centers=i, nstart=20)$withinss)
}

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

plot(1:clust, wss, type="b", xlab="Numero de clusters", ylab="wss")

In this case it is clear that the best choice, given the criteria is k=2 (for data3, which was our simplest case of study). As shown later on, the critic elements (those who contributed the most for the clustering analizing data3) are removed and then we apply the clustering for data2.

the optimal number of clusters is no so clear for the data2, we proceed to check the results of cluster for different numbers of k (2,3 & 4).

km.out <- kmeans(data2, 3, nstart = 50)    #apply k-means with k=2
c=km.out$cluster    #save in variable c the target of each row, this will be used later.

km.out$size   #Check the size of each cluster

## [1] 81913 22590 77188

km.out <- kmeans(data3, 2, nstart = 50)    #apply k-means with k=2
c_29=km.out$cluster    #save in variable c the target of each row, this will be used later.

km.out$size   #Check the size of each cluster

## [1] 89045 92646

wss_29<-sum(kmeans(data3, 2, nstart=20)$withinss)  #Saves the SSE of the data, used later for validation
wss_29   #Check the SSE

## [1] 245529.9

Validation of K-means

compare SSE against Noise´s SSE.

We create a small matrix (1000x29) of aleatory noise to compare SSE of k menas with k=2. WE create the function Noisify to have random numbers in the range [0,1], matching the range of the normalized data. We do the same for a matrix matching the size of the normalized DS.

Noisify <- function(data) {

  if (is.vector(data)) {
    noise <- runif(length(data), 0, 1)
    noisified <- data + noise
  } else {
    length <- dim(data)[1] * dim(data)[2]
    noise <- matrix(runif(length, 0, 1), dim(data)[1])
    noisified <- data + noise
  }
  return(noisified)
}
   
a<-matrix(0, nrow = 1000, ncol = 25)

matrix_29<-matrix(0, nrow = 1000, ncol = 29)  #This works with the simple case

noise_a<-Noisify(a)

noise_matrix_29<-Noisify(matrix_29)

big_matrix<-matrix(0, nrow =181691 , ncol = 25)

big_noise<-Noisify(big_matrix)

big_matrix_29<-matrix(0, nrow =181691 , ncol = 29)

big_noise_29<-Noisify(big_matrix_29)

Now we Compare the SSE between several iterations of random noise against the obtanied with the data.

We apply the procedure to the simple case (data3), which we only need to check with k=2.

km_noise_29 <- kmeans(noise_matrix_29, 2, nstart = 50)
#km_noise

wss_n2_29<-sum(kmeans(big_noise_29, 2, nstart=20)$withinss) # we calculate the SSE for noise several times this time with k=4
wss_n2_29

## [1] 427540.8

#We obtain SSE from noise and compare them with SSe from the Data.
SSE_of_Noise_and_data3=c(427575.6,427572.7,245529.9,427575.6,427572.7, 427575.6,427572.7,427571.5,427866,427907.1,427725.4)

hist(SSE_of_Noise_and_data3)  #Histogram

This is the case for data2, with k=4.

km_noise <- kmeans(noise_a, 4, nstart = 50)
#km_noise

wss_n2<-sum(kmeans(big_noise, 4, nstart=20)$withinss) # we calculate the SSE for noise several times this time with k=4

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)

wss_n2

## [1] 355070.2

wss_k_m_4<-sum(kmeans(data2, 4, nstart=20)$withinss) # we calculate the SSE for our data
wss_k_m_4

## [1] 176570.9

#We obtain 10 SSE from noise and compare them with SSe from the Data.
wss_10=c(427655.5,427600,427706,427611.2, 427998.7,130562,428006.7,427579.1, 427613.7, 427638.4,427616.1)

SSE_k_4=c( 354713.9,355171.7,354772, 355783.7,176570.9,354885.4)

#hist(wss_10)  #histogram showing the diff. of SSE of several noise samples and the DS

hist(SSE_k_4)  #Histogram of the same procedure with k=4

We can see a notorious differences between SSE for noise and data in the both cases. Statistically significant!

Validation using Correlation

In this section we calculate the matrices of incidence & proximity, and then the correlation between them. this is applied to the normalized data and the generated noise. Because of the high computacional cost of this, given the high dim of the pre processed data (180000 x 29), we calculate in a sample (with dim of 1000x29). It´s worth notice that we keep the same proportion of classes obtained with k means.

a<-matrix(0, nrow = 1000, ncol = 29)  #usedto sample many times noise
noise_a<-Noisify(a)
km_noise_a <- kmeans(noise_a, 2, nstart = 50) #calculate k means for the noise
a=km_noise_a$cluster #save labels

j=cbind(noise_a,a)  # bind data / targets
       
sorted=j[order(j[,30]),] # order the data given the target
#head(sorted)  #Verify
sorted_a=sorted[,-30]   # delete the target
b=sorted[,30]      # save separately the targets
#head(sorted_a)      # Verify
incidence_m<-matrix(0, nrow = 1000, ncol = 1000)

now we check for the correlation between the two obtained matrices.

#Function to create an incidence matrix with 1s and 0´s
incidence<-function(incidence_m,a){
  for (i in 1:1000){
    for (j in 1:1000){
      if (a[i]==a[j]){
        incidence_m[i,j]<-1
      }else{
        incidence_m[i,j]<-0
      }
    }
  }
  return (incidence_m)
}

incidence_a=incidence(incidence_m,b)   #Incidence in noise
m_a <- as.matrix(dist(sorted_a))        #calculate distance matrix
#dim(incidence_a)                         #Verify dim
cor_a=cor(c(m_a),c(incidence_a))        # calculate correlation between matrices seen as vector.
cor_a

## [1] -0.1446389

#Now we create a a vector of several correlationn for different random noises

var_cor<- matrix(0, nrow = 1, ncol = 1)

for (i in 1:50){
  
a<-matrix(0, nrow = 1000, ncol = 29)  #usedto sample many times noise
noise_a<-Noisify(a)
km_noise_a <- kmeans(noise_a, 2, nstart = 50) #calculate k means for the noise
a=km_noise_a$cluster #save labels
j=cbind(noise_a,a)  # bind data / targets
sorted=j[order(j[,30]),] # order the data given the target
sorted_a=sorted[,-30]   # delete the target
b=sorted[,30]      # save separately the targets
incidence_m<-matrix(0, nrow = 1000, ncol = 1000)   
incidence_a=incidence(incidence_m,b)   #Incidence in noise
m_a <- as.matrix(dist(sorted_a))        #calculate distance matrix
cor_a=cor(c(m_a),c(incidence_a))        # calculate correlation between matrices seen as vector.
#var[i]=cor_a
var_cor=append(var_cor, cor_a)
i=i+1
if (i==50){
        break
  }
}

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)

#var_cor
#c(var_cor)

Apply the same procedure to a sample of the data.

#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k=cbind(data3,c)
#dim(k)
sorted_data=k[order(k[,30]),] # order according to the cluster target
#head(sorted_data)  #Verify
#sorted_data2=sorted_data[,-30]   # se elimina la etiqueta de cluster
#d=sorted_data[,30]      # se guarda como columna aparte las etiquetas ordenas
#head(d) 
#d[90000]

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]          #Second.

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
dim(sorted_data)#verify dim

## [1] 181691     30

samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000)  #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 


#dim(sample_c1)
#dim(sample_c2) #verify

sample_data=rbind(sample_c1,sample_c2) #Unimos muestras

d=sample_data[,30]
sample_data2=sample_data[-30]

dim(sample_data2)   #Verify dimensionality

## NULL

verify correlation between matrices.

incidence_data<-matrix(0, nrow = 1000, ncol = 1000) # create matrix
#length(d) Check
incidence_sample_data=incidence(incidence_data,d)   #create incidence matrix.
#head(incidence_sample_data)
m_data <- as.matrix(dist(sample_data))        #Create distance matrix
m_data[1000,1000]               #Verify values for the diagonal (must be 1´s)

## [1] 0

cor_data=cor(c(m_data),c(incidence_sample_data))        # check correlation value
cor_data

## [1] -0.697568

We have very different values. We present an histogram (including various random noise samples) to check statistical significancy:

corr_10=append(-0.7989984,c(var_cor))

hist(corr_10)

WE can see a clear difference in the values!

We now proceed to check with

Checking features of each cluster

Proceding with data3, which contains 29 cols of info

Only has the case with k=2, because of the optimal case seen in the begging.

km.out <- kmeans(data3, 2, nstart = 50)    #apply k-means with k=2
c2=km.out$cluster    #save in variable c the target of each row, this will be used later.

km.out$size   #Check the size of each cluster

## [1] 89045 92646

#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k2=cbind(data3,c2)
#dim(k)
sorted_data=k2[order(k2[,30]),] # order according to the cluster target
#head(sorted_data)  #Verify
#sorted_data2=sorted_data[,-30]   # se elimina la etiqueta de cluster
#d=sorted_data[,30]      # se guarda como columna aparte las etiquetas ordenas
#head(d) 
#d[90000]

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]          #Second.

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
dim(sorted_data)#verify dim

## [1] 181691     30

samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000)  #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 


dim(sample_c1)

## [1] 490  30

dim(sample_c2) #verify

## [1] 510  30

sample_data_k2=rbind(sample_c1,sample_c2) #Unimos muestras

d2=sample_data_k2[,30]
sample_data_2=sample_data_k2[,-30]

dim(sample_data_2)   #Verify dimensionality

## [1] 1000   29

library(knitr)
promedios_29_c<-matrix(0, nrow = 3, ncol = 29)
#mean(sample_c1[,1])
keys_29=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid","INT_LOG","INT_IDEO","INT_MISC","INT_ANY")
for (i in 1:29){
promedios_29_c[1,i]=keys_29[i]
promedios_29_c[2,i]=mean(sample_c1[,i])
promedios_29_c[3,i]=mean(sample_c2[,i])
  }
Tabla_29_C <- promedios_29_c
kable(Tabla_29_C)

iyear	imonth	iday	extended	country	region	specificity	vicinity	crit1	crit2	crit3	doubtterr	multiple	success	suicide	attacktype1	targtype1	targsubtype1	natlty1	guncertain1	individual	weaptype1	nkill	property	ishostkid	INT_LOG	INT_IDEO	INT_MISC	INT_ANY
0.655536257056014	0.525	0.514878209348255	0.0673469387755102	0.136608163265306	0.541372912801484	0.128571428571429	0.906326530612245	0.995918367346939	0.983673469387755	0.83469387755102	0.913877551020408	0.187755102040816	0.93469387755102	0.0469387755102041	0.285204081632653	0.340233236151604	0.408367370168044	0.121067468006455	0.155268245259949	0	0.457823129251701	0.00272860331529072	0.874489795918367	0.908163265306122	0.908367346938776	0.924897959183673	0.91265306122449	0.931836734693878
0.715978306216103	0.538888888888889	0.500948766603416	0.0313725490196078	0.121778431372549	0.568805704099822	0.0872549019607843	0.907647058823529	0.968627450980392	0.994117647058824	0.909803921568627	0.775882352941176	0.0705882352941176	0.872549019607843	0.0215686274509804	0.268137254901961	0.39094304388422	0.425855873423162	0.116484546365345	0.0137254901960784	0.00980392156862745	0.447385620915033	0.00126599042733592	0.859803921568628	0.905294117647059	0	0	0.906666666666667	0.101960784313725

We can see a notorious different in the means for the cols INT_LOG, INT_IDEO & INT_ANY for the clusters. So the clusters are mainly ruled by the difference of values of these cols. Given the information we wanted to do a sensibility analysis, removing these cols (these is the reason we created data2, removing them) and check the clusters formed by kmeans without these cols. Now we proceed to check the values of the means for clusters with k=2,3,4 (working on data2, which does not have the values INT).

verify correlation between matrices.

incidence_data<-matrix(0, nrow = 1000, ncol = 1000) # create matrix
#length(d) Check
incidence_sample_data=incidence(incidence_data,d)   #create incidence matrix.
#head(incidence_sample_data)
m_data <- as.matrix(dist(sample_data))        #Create distance matrix
m_data[1000,1000]               #Verify values for the diagonal (must be 1´s)

## [1] 0

cor_data=cor(c(m_data),c(incidence_sample_data))        # check correlation value
cor_data

## [1] -0.697568

We have very different values. We present an histogram (including various random noise samples) to check statistical significancy:

corr_10=append(-0.7989984,c(var_cor))

hist(corr_10)

WE can see a clear difference in the values!

#For bigger Samples

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]          #Second.

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
dim(sorted_data)#verify dim

## [1] 181691     30

samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*5000)  #we definy first and second cluster size SIMILAR from earlier on the code but we now have a sample size of 5k
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*5000)+1

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 

#dim(sample_c1)
#dim(sample_c2) #verify

sample_data=rbind(sample_c1,sample_c2) #Unimos muestras
sample_data<-as.data.frame(sample_data)
dim(sample_data)  #Check Dimensionailty

## [1] 5000   30

km.out <- kmeans(data2, 2, nstart = 50)    #apply k-means with k=2
c2=km.out$cluster    #save in variable c the target of each row, this will be used later.

km.out$size   #Check the size of each cluster

## [1] 103058  78633

#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k2=cbind(data2,c2)
#dim(k)
sorted_data=k2[order(k2[,26]),] # order according to the cluster target
#head(sorted_data)  #Verify
#sorted_data2=sorted_data[,-30]   # se elimina la etiqueta de cluster
#d=sorted_data[,30]      # se guarda como columna aparte las etiquetas ordenas
#head(d) 
#d[90000]

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]          #Second.

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
dim(sorted_data)#verify dim

## [1] 181691     26

samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000)  #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 


dim(sample_c1)

## [1] 567  26

dim(sample_c2) #verify

## [1] 433  26

sample_data_k2=rbind(sample_c1,sample_c2) #Unimos muestras

d2=sample_data_k2[,26]
sample_data_2=sample_data_k2[,-26]

dim(sample_data_2)   #Verify dimensionality

## [1] 1000   25

library(knitr)
promedios_1<-matrix(0, nrow = 3, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_1[1,i]=keys_25[i]
promedios_1[2,i]=mean(sample_c1[,i])
promedios_1[3,i]=mean(sample_c2[,i])
  }
Tabla <- promedios_1
kable(Tabla)

iyear	imonth	iday	extended	country	region	specificity	vicinity	crit1	crit2	crit3	doubtterr	multiple	success	suicide	attacktype1	targtype1	targsubtype1	natlty1	guncertain1	individual	weaptype1	nkill	property	ishostkid
0.654283462794101	0.536743092298648	0.483814075211925	0.0229276895943563	0.129968253968254	0.535674202340869	0.107583774250441	0.908289241622575	1	0.998236331569665	0.798941798941799	0.831922398589065	0.0864197530864197	0.894179894179894	0.0352733686067019	0.290564373897707	0.0980095742000504	0.209616505004635	0.123239858906526	0.0740740740740741	0.0017636684303351	0.462815990593768	0.00161977186168926	0.837918871252205	0.896483078240124
0.733870571470689	0.553117782909931	0.514639052372793	0.0600461893764434	0.133644341801386	0.610959479319756	0.109699769053118	0.908083140877598	0.976905311778291	0.986143187066975	1	0.851732101616628	0.131639722863741	0.886836027713626	0.0300230946882217	0.288394919168591	0.70130869899923	0.683929199070698	0.120040396139572	0.0900692840646651	0	0.454003079291763	0.00163964315350063	0.828637413394919	0.90554272517321

Most notable differences for TargetType1, targsubtype1.

Apply the same to k=3

km.out <- kmeans(data2, 3, nstart = 50)    #apply k-means with k=3
c3=km.out$cluster    #save in variable c the target of each row, this will be used later.

km.out$size   #Check the size of each cluster

## [1] 81913 77188 22590

#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k3=cbind(data2,c3)
#dim(k)
sorted_data=k3[order(k3[,26]),] # order according to the cluster target
#head(sorted_data)  #Verify
#sorted_data2=sorted_data[,-30]   # se elimina la etiqueta de cluster
#d=sorted_data[,30]      # se guarda como columna aparte las etiquetas ordenas
#head(d) 
#d[90000]

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]         #Second.
size_clust3=km.out$size[3]         #Third

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
c3_sorted_data=sorted_data[(size_clust1+size_clust2-1):(size_clust1+size_clust2+size_clust3-1),]         #Data from the third
dim(sorted_data)#verify dim

## [1] 181691     26

samples_c1=trunc(size_clust1/(size_clust1+size_clust2+size_clust3)*1000)  #we definy clusters reduced size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2+size_clust3)*1000)+1
samples_c3=trunc(size_clust3/(size_clust1+size_clust2+size_clust3)*1000)+1  


samples_c1+samples_c2+samples_c3  #Verify sum=1000

## [1] 1000

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 
sample_c3=c3_sorted_data[sample(nrow(c3_sorted_data), samples_c3,replace = FALSE),]


#dim(sample_c1)
#dim(sample_c2) #verify

sample_data_k3=rbind(sample_c1,sample_c2,sample_c3) #Unimos muestras

d3=sample_data_k3[,26]
sample_data_3=sample_data_k3[,-26]

dim(sample_data_3)   #Verify dimensionality

## [1] 1000   25

promedios_2<-matrix(0, nrow = 4, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_2[1,i]=keys_25[i]
promedios_2[2,i]=mean(sample_c1[,i])
promedios_2[3,i]=mean(sample_c2[,i])
promedios_2[4,i]=mean(sample_c3[,i])
  }
Tabla2 <- promedios_2
kable(Tabla2)

iyear	imonth	iday	extended	country	region	specificity	vicinity	crit1	crit2	crit3	doubtterr	multiple	success	suicide	attacktype1	targtype1	targsubtype1	natlty1	guncertain1	individual	weaptype1	nkill	property	ishostkid
0.65806146572104	0.546481481481481	0.508888888888889	0.0333333333333333	0.133968888888889	0.516161616161616	0.0844444444444444	0.906222222222222	0.991111111111111	0.997777777777778	1	0.824888888888889	0.117777777777778	0.893333333333333	0.0311111111111111	0.279722222222222	0.095026455026455	0.199355457533998	0.12306596986901	0.0577777777777778	0.00222222222222222	0.457777777777778	0.00110214037321241	0.878222222222222	0.902248468276224
0.742127659574468	0.512745098039216	0.504667931688805	0.0658823529411765	0.123632941176471	0.59379679144385	0.127647058823529	0.905176470588235	0.976470588235294	0.985882352941176	1	0.825411764705882	0.141176470588235	0.891764705882353	0.0282352941176471	0.291764705882353	0.693557422969188	0.669300164046987	0.118672622142455	0.0756773655405872	0.00470588235294118	0.468235294117647	0.00167983812749427	0.833647058823529	0.908941176470588
0.730893617021277	0.546	0.507612903225806	0	0.13728	0.576	0.12	0.9112	1	1	0	0.9992	0.096	0.912	0.04	0.285	0.142857142857143	0.2705693910072	0.13572	0.0246515214189983	0.008	0.456	0.00214851369264336	0.8376	0.9008

the cluster 3 has small values for extended, crit3 and targettype1. We see again different values for the clusters on target type1 and targetsubtype.

apply the same to k=4

km.out <- kmeans(data2, 4, nstart = 50)    #apply k-means with k=4
c4=km.out$cluster    #save in variable c4 the target of each row, this will be used later.
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k4=cbind(data2,c4)
#dim(k)
sorted_data=k4[order(k4[,26]),] # order according to the cluster target
#head(sorted_data)  #Verify
#sorted_data2=sorted_data[,-30]   # se elimina la etiqueta de cluster
#d=sorted_data[,30]      # se guarda como columna aparte las etiquetas ordenas
#head(d) 
#d[90000]

#Sample with k=2
size_clust1=km.out$size[1]         #First clust size
size_clust2=km.out$size[2]         #Second.
size_clust3=km.out$size[3]         #Third
size_clust4=km.out$size[4]         #Fourth.

c1_sorted_data=sorted_data[1:size_clust1,]         #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),]   #data from second
c3_sorted_data=sorted_data[(size_clust1+size_clust2-1):(size_clust1+size_clust2+size_clust3-1),]         #Data from the third
c4_sorted_data=sorted_data[(size_clust1+size_clust2+size_clust3):(size_clust1+size_clust2+size_clust3+size_clust4-1),]   #data from second
dim(sorted_data)#verify dim

## [1] 181691     26

samples_c1=trunc(size_clust1/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)  #we definy clusters reduced size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)+1
samples_c3=trunc(size_clust3/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)  
samples_c4=trunc(size_clust4/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)+1


samples_c1+samples_c2+samples_c3+samples_c4  #Verify sum=1000

## [1] 1000

sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),] 
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),] 
sample_c3=c3_sorted_data[sample(nrow(c3_sorted_data), samples_c3,replace = FALSE),] 
sample_c4=c4_sorted_data[sample(nrow(c4_sorted_data), samples_c4,replace = FALSE),] 


#dim(sample_c1)
#dim(sample_c2) #verify

sample_data_k4=rbind(sample_c1,sample_c2,sample_c3,sample_c4) #Unimos muestras

d4=sample_data_k4[,26]
sample_data_4=sample_data_k4[,-26]

dim(sample_data_4)   #Verify dimensionality

## [1] 1000   25

promedios_3<-matrix(0, nrow = 5, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_3[1,i]=keys_25[i]
promedios_3[2,i]=mean(sample_c1[,i])
promedios_3[3,i]=mean(sample_c2[,i])
promedios_3[4,i]=mean(sample_c3[,i])
promedios_3[5,i]=mean(sample_c4[,i])
  }
Tabla3 <- promedios_3
kable(Tabla3)

iyear	imonth	iday	extended	country	region	specificity	vicinity	crit1	crit2	crit3	doubtterr	multiple	success	suicide	attacktype1	targtype1	targsubtype1	natlty1	guncertain1	individual	weaptype1	nkill	property	ishostkid
0.721841567496093	0.556261770244821	0.490978676872608	0.0480225988700565	0.12869209039548	0.567282999486389	0.118962889610425	0.908757062146893	0.985875706214689	0.985875706214689	1	0.824576271186441	0	0.827683615819209	0.0310734463276836	0.278954802259887	0.696529459241324	0.663934821646403	0.114831896972979	0.0743665556765513	0.00564971751412429	0.461158192090395	0.00160794230392344	0.853389830508475	0.907942968147742
0.644208037825059	0.532828282828283	0.504398826979472	0.047979797979798	0.123280303030303	0.538337924701561	0.106691919191919	0.907575757575758	0.98989898989899	0.997474747474748	1	0.819191919191919	0	0.911616161616162	0.0353535353535354	0.262626262626263	0.0983645983645984	0.207510711122689	0.126468905154178	0.0757575757575758	0	0.437079124579125	0.00115815654069343	0.860858585858586	0.906313131313131
0.692004118050789	0.50739247311828	0.491155046826223	0.0161290322580645	0.135379032258065	0.565982404692082	0.189516129032258	0.912903225806452	1	1	0	1	0.0725806451612903	0.919354838709677	0.0483870967741935	0.282258064516129	0.142857142857143	0.272029302145852	0.136975806451613	0.024850324011087	0	0.446236559139785	0.00312639106318708	0.818548387096774	0.904032258064516
0.754643701452212	0.541666666666667	0.516385048643113	0.0238095238095238	0.13152380952381	0.542568542568543	0.136904761904762	0.906349206349206	1	0.992063492063492	1	0.865079365079365	1	0.920634920634921	0.0396825396825397	0.334325396825397	0.473922902494331	0.506005050092627	0.126360102238958	0.0641384141061491	0	0.462301587301587	0.000666295342216777	0.864285714285714	0.898412698412698

Visualization of two clusters based on features comparison:

We see some differences between both clusters, but there are only two variables (INT_LOG and INT_IDEO) influence the difference of points on two clusters.

Silhouette Analysis for validation of clusters

We will see clustering analysis with silhouette score method. That method needs a sample of original data.

library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

set.seed(123)
sample_data2=rbind(sample_c1,sample_c2)
sample_data2<-as.data.frame(sample_data2)
fviz_nbclust(sample_data2, kmeans, method='silhouette')

The optimal number of clusters is 2. Now, we will see clustering assignment with 2 clusters in a sample of the data (5k), keeping the same proportion of classes from earlier on the code.

In the last table we can look differences between both clusters. Observing the last analysis, it can be concluded that there are no visible characteristics of both types of clusters, for the silhouette score. The model performed only discriminates two variables(“INT_LOG” and “INT_IDEO”), and determines that they have greater separation in the data and fix the existing clusters.

Association rules

After having explored the classification and clustering methods and learnt a lot about our data the last thing we’re going to do is studying if there are certain pattern. For this we are going to investigate using the a-priori algorithm. We are working with the usual data:

library(readr)
library(dplyr)
set.seed(123)

setwd("C:/Users/felip/Desktop/Análisis de Datos")
data <- read_csv("globalterrorismdb_0718dist.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   approxdate = col_character(),
##   resolution = col_character(),
##   country_txt = col_character(),
##   region_txt = col_character(),
##   provstate = col_character(),
##   city = col_character(),
##   location = col_character(),
##   summary = col_character(),
##   alternative_txt = col_character(),
##   attacktype1_txt = col_character(),
##   attacktype2_txt = col_character(),
##   attacktype3 = col_logical(),
##   attacktype3_txt = col_logical(),
##   targtype1_txt = col_character(),
##   targsubtype1_txt = col_character(),
##   corp1 = col_character(),
##   target1 = col_character(),
##   natlty1_txt = col_character(),
##   targtype2_txt = col_character(),
##   targsubtype2_txt = col_character()
##   # ... with 48 more columns
## )

## See spec(...) for full column specifications.

## Warning: 7113 parsing failures.
##  row             col           expected                                                                                                                                                                                     actual                             file
## 1687 ransomamtus     1/0/T/F/TRUE/FALSE 20000                                                                                                                                                                                      'globalterrorismdb_0718dist.csv'
## 1687 ransomnote      1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender.  However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3     1/0/T/F/TRUE/FALSE 2                                                                                                                                                                                          'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault                                                                                                                                                                              'globalterrorismdb_0718dist.csv'
## 3432 ransomnote      1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands.                                                                                            'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.

dim(data)

## [1] 181691    135

# doubtterr is a binary variable describing the doubts about the "reliability" of the event
df <- data %>% filter(doubtterr == 0   & nkill >= 0)

rm(data)

df <- df %>%
  
  dplyr::select(
    # Time
    iyear,   # Year
    imonth,  # Month
    iday,    # Day
    
    # Geospatial
    city,      # City
    latitude,  # Geo coordinate
    longitude, # Geo coordinate
    country,   # Country id
    region,    # Region id, see docs for details
    
    # Numerical
    nperps,    # Number of perpetrators
    nkill,     # Death toll
    nwound,    # No of casualties
    nkillter,  # Death toll - terrorist only
    
    # Binary
    vicinity,  # Event occured in city or next to it, binary          
    ishostkid, # Any hostages ?
    ransom,    # Any ransom demanded ? 
    extended,  # Event duration above 24hrs
    
    # Categorical
    country_txt,     # Desc country
    region_txt,      # Desc region  
    attacktype1_txt, # Desc attack type
    weaptype1_txt,   # Desc weapon type
    targtype1_txt,   # Desc target type
    propextent,      # Amount of damage done          
    
    # Text
    summary, # Motive text variable
    gname)  # Name of organization

long_name <- "Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Vehicle"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)

long_name <- "Explosives/Bombs/Dynamite"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Explosives"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)

### More reduced data

ml_df <- df %>% 
  # Initial filter
  dplyr::filter(nkill >= 0 & 
                  nwound >= 0 & 
                  (longitude >= -180 & longitude <= 200) &
                  (latitude  >= -90  & latitude  <= 90) & 
                  (imonth >= 1 & imonth <= 12)) %>%
  
  # Remove outliers
  dplyr::filter( !(abs(nkill-mean(nkill))/sd(nkill) > 4)) %>%
  dplyr::filter( !(abs(nwound-mean(nwound))/sd(nwound) > 4)) %>%
  
  # Clean dataset
  dplyr::mutate(
    # Fill with median
    #nperps = ifelse(is.na(nperps) | nperps < 1, perp_median, nperps),
    
    # Check binary variables
    vicinity = ifelse((vicinity != 0 & vicinity != 1) | is.na(vicinity), 0, vicinity),
    extended = ifelse((extended != 0 & extended != 1) | is.na(extended), 0, extended),
    ransom = ifelse((ransom != 0 & ransom != 1) | is.na(ransom), 0, ransom),
    ishostkid = ifelse((ishostkid != 0 & ishostkid != 1) | is.na(ishostkid), 0, ishostkid),
    propextent = ifelse( (propextent < 1 | propextent > 4) | is.na(propextent), 4, propextent)) %>%
  
  
  # Feature engineering
  dplyr::mutate(
    # Group known
    gknow = ifelse(gname != "Unknown", 1, 0),
    
    # Scale coordinates
    lat = as.vector(scale(latitude)),
    long = as.vector(scale(longitude))) %>%
  
  # Create binary target
  dplyr::mutate(target = ifelse(round(nkill, 0) > 0, 1, 0)) %>%
  
  # Recode vars, scale numericals, convert cat's to factors.
  dplyr::mutate(month = factor(imonth), 
                cnty  = factor(country_txt),
                rgn   = factor(region_txt),
                vnity = factor(vicinity), 
                ext   = factor(extended),
                rans  = factor(ransom),
                host  = factor(ishostkid),
                atype = factor(attacktype1_txt),
                wtype = factor(weaptype1_txt),
                ttype = factor(targtype1_txt),
                gknow = factor(gknow),
                gname = factor(gname),
                prop  = factor(propextent),
                trgt  = factor(target),
                lat   = as.vector(scale(latitude)),
                long  = as.vector(scale(longitude))) %>% 
  
  # Select clean data
  dplyr::select(rgn, wtype, atype, ttype, trgt, gname)

In order to work with the algorithm in r we need to convert the data into a “transaction” object:

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

## Loading required package: grid

tData <- as(ml_df, "transactions")

Now we are ready to explore the world of association rules. We will begin the analysis with a brief look at the frequency of the item sets:

frequentItems <- eclat(tData, parameter = list(supp = 0.5, maxlen = 5)) # calculates support for frequent items

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE     0.5      1      5 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 62076 
## 
## create itemset ... 
## set transactions ...[2957 item(s), 124152 transaction(s)] done [0.16s].
## sorting and recoding items ... [3 item(s)] done [0.02s].
## creating bit matrix ... [3 row(s), 124152 column(s)] done [0.00s].
## writing  ... [4 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequentItems)

##     items                                      support   count
## [1] {wtype=Explosives,atype=Bombing/Explosion} 0.5426493 67371
## [2] {wtype=Explosives}                         0.5748598 71370
## [3] {trgt=0}                                   0.5512275 68436
## [4] {atype=Bombing/Explosion}                  0.5500677 68292

itemFrequencyPlot(tData, topN=10, type="absolute", main="Item Frequency") # plot frequent items

From the histogram is possible to see the frequency of the itemsets and this will be particularly useful in the future analysis. So what comes next? Let’s see the first analysis what results give:

rules <- apriori (tData, parameter = list(supp = 0.1, conf = 0.5, minlen = 2, maxlen=5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target   ext
##       5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 12415 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2957 item(s), 124152 transaction(s)] done [0.15s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [78 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                 rhs                  support confidence     lift count
## [1] {rgn=Middle East & North Africa,                                                          
##      atype=Bombing/Explosion,                                                                 
##      gname=Unknown}                  => {wtype=Explosives} 0.1366551  0.9970616 1.734443 16966
## [2] {atype=Bombing/Explosion,                                                                 
##      trgt=1,                                                                                  
##      gname=Unknown}                  => {wtype=Explosives} 0.1075698  0.9950823 1.731000 13355
## [3] {rgn=Middle East & North Africa,                                                          
##      atype=Bombing/Explosion,                                                                 
##      trgt=1}                         => {wtype=Explosives} 0.1019637  0.9945789 1.730124 12659
## [4] {rgn=Middle East & North Africa,                                                          
##      atype=Bombing/Explosion}        => {wtype=Explosives} 0.1951318  0.9945400 1.730056 24226
## [5] {rgn=South Asia,                                                                          
##      atype=Bombing/Explosion}        => {wtype=Explosives} 0.1437351  0.9938182 1.728801 17845
## [6] {atype=Bombing/Explosion,                                                                 
##      gname=Unknown}                  => {wtype=Explosives} 0.2989883  0.9935494 1.728333 37120

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                          rhs                                support confidence     lift count
## [1] {atype=Armed Assault,                                                                            
##      trgt=1}                  => {wtype=Firearms}                 0.1277225  0.9152141 3.214123 15857
## [2] {wtype=Firearms,                                                                                 
##      trgt=1}                  => {atype=Armed Assault}            0.1277225  0.6245372 3.197952 15857
## [3] {wtype=Firearms}          => {atype=Armed Assault}            0.1702188  0.5977880 3.060982 21133
## [4] {atype=Armed Assault}     => {wtype=Firearms}                 0.1702188  0.8716077 3.060982 21133
## [5] {wtype=Explosives,                                                                               
##      atype=Bombing/Explosion,                                                                        
##      trgt=1}                  => {rgn=Middle East & North Africa} 0.1019637  0.5498653 1.894829 12659
## [6] {atype=Bombing/Explosion,                                                                        
##      trgt=1}                  => {rgn=Middle East & North Africa} 0.1025195  0.5459381 1.881295 12728

At a first glance it’s immediatly clear that the attack type “Bombing/Explosion” (and consequently weapon type “Explosives”) dominates the rules pattern and for this in the next analysis we will drop the observation regarding these attacks.

What is useful to look at is redundant rules. These rules are subsets of larger rules and thus can be dropped from the rules given by the algorithm.

#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)

## [1] 65

rules <- rules[-subsetRules] # remove subset rules.

rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                    rhs                         support confidence     lift count
## [1] {wtype=Firearms}                    => {trgt=1}                  0.2045074  0.7182055 1.600378 25390
## [2] {atype=Armed Assault}               => {trgt=1}                  0.1395547  0.7145921 1.592326 17326
## [3] {rgn=Middle East & North Africa}    => {wtype=Explosives}        0.2045477  0.7048684 1.226157 25395
## [4] {rgn=Middle East & North Africa}    => {atype=Bombing/Explosion} 0.1962030  0.6761130 1.229145 24359
## [5] {rgn=Middle East & North Africa}    => {gname=Unknown}           0.1848379  0.6369490 1.361521 22948
## [6] {ttype=Private Citizens & Property} => {trgt=1}                  0.1511695  0.5874914 1.309107 18768

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                                    rhs                         support confidence     lift count
## [1] {wtype=Firearms}                    => {trgt=1}                  0.2045074  0.7182055 1.600378 25390
## [2] {atype=Armed Assault}               => {trgt=1}                  0.1395547  0.7145921 1.592326 17326
## [3] {rgn=Middle East & North Africa}    => {gname=Unknown}           0.1848379  0.6369490 1.361521 22948
## [4] {ttype=Private Citizens & Property} => {trgt=1}                  0.1511695  0.5874914 1.309107 18768
## [5] {rgn=Middle East & North Africa}    => {trgt=1}                  0.1627682  0.5608971 1.249847 20208
## [6] {rgn=Middle East & North Africa}    => {atype=Bombing/Explosion} 0.1962030  0.6761130 1.229145 24359

Dropping the redundant rules seems to fix in some way the dependance on the event “Explosion/Bombing”. From the lists above is possible to see the most confidential rules and the ones with higher lift.
Let’s have a look on what happens if we drop it.

We are going to investigate what are the rules without that kind of event:

tData_2 <- as(ml_df[which(ml_df$atype!="Bombing/Explosion"),], "transactions")

frequentItems <- eclat(tData_2, parameter = list(supp = 0.5, maxlen = 5)) # calculates support for frequent items

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE     0.5      1      5 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 27930 
## 
## create itemset ... 
## set transactions ...[2113 item(s), 55860 transaction(s)] done [0.07s].
## sorting and recoding items ... [2 item(s)] done [0.02s].
## creating bit matrix ... [2 row(s), 55860 column(s)] done [0.00s].
## writing  ... [2 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequentItems)

##     items            support   count
## [1] {wtype=Firearms} 0.6246867 34895
## [2] {trgt=1}         0.5800573 32402

itemFrequencyPlot(tData_2, topN=10, type="absolute", main="Item Frequency") # plot frequent items

rules <- apriori (tData_2, parameter = list(supp = 0.1, conf = 0.5, minlen = 2, maxlen=5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target   ext
##       5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5586 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2113 item(s), 55860 transaction(s)] done [0.07s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [63 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                       rhs                support confidence     lift count
## [1] {wtype=Incendiary,                                                                            
##      atype=Facility/Infrastructure Attack} => {trgt=0}         0.1011636  0.9701288 2.310145  5651
## [2] {atype=Facility/Infrastructure Attack} => {trgt=0}         0.1253491  0.9521349 2.267297  7002
## [3] {wtype=Incendiary}                     => {trgt=0}         0.1202649  0.9475317 2.256336  6718
## [4] {atype=Armed Assault,                                                                         
##      trgt=1,                                                                                      
##      gname=Unknown}                        => {wtype=Firearms} 0.1230755  0.9464482 1.515077  6875
## [5] {rgn=South Asia,                                                                              
##      atype=Armed Assault,                                                                         
##      trgt=1}                               => {wtype=Firearms} 0.1028285  0.9228792 1.477347  5744
## [6] {atype=Armed Assault,                                                                         
##      trgt=1}                               => {wtype=Firearms} 0.2838704  0.9152141 1.465077 15857

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                                       rhs                                      support confidence     lift count
## [1] {wtype=Incendiary,                                                                                                  
##      trgt=0}                               => {atype=Facility/Infrastructure Attack} 0.1011636  0.8411730 6.389437  5651
## [2] {atype=Facility/Infrastructure Attack,                                                                              
##      trgt=0}                               => {wtype=Incendiary}                     0.1011636  0.8070551 6.358547  5651
## [3] {wtype=Incendiary}                     => {atype=Facility/Infrastructure Attack} 0.1042786  0.8215797 6.240609  5825
## [4] {atype=Facility/Infrastructure Attack} => {wtype=Incendiary}                     0.1042786  0.7920859 6.240609  5825
## [5] {wtype=Incendiary,                                                                                                  
##      atype=Facility/Infrastructure Attack} => {trgt=0}                               0.1011636  0.9701288 2.310145  5651
## [6] {atype=Facility/Infrastructure Attack} => {trgt=0}                               0.1253491  0.9521349 2.267297  7002

#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)

## [1] 44

rules <- rules[-subsetRules] # remove subset rules.

rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                       rhs              support  
## [1] {atype=Facility/Infrastructure Attack} => {trgt=0}         0.1253491
## [2] {wtype=Incendiary}                     => {trgt=0}         0.1202649
## [3] {atype=Assassination}                  => {trgt=1}         0.1754565
## [4] {ttype=Police}                         => {wtype=Firearms} 0.1437701
## [5] {ttype=Police}                         => {trgt=1}         0.1384891
## [6] {atype=Assassination}                  => {wtype=Firearms} 0.1650555
##     confidence lift     count
## [1] 0.9521349  2.267297 7002 
## [2] 0.9475317  2.256336 6718 
## [3] 0.7378604  1.272048 9801 
## [4] 0.7332238  1.173746 8031 
## [5] 0.7062905  1.217622 7736 
## [6] 0.6941203  1.111149 9220

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                                       rhs                  
## [1] {atype=Facility/Infrastructure Attack} => {trgt=0}             
## [2] {wtype=Incendiary}                     => {trgt=0}             
## [3] {ttype=Police}                         => {atype=Armed Assault}
## [4] {rgn=Middle East & North Africa}       => {gname=Unknown}      
## [5] {atype=Assassination}                  => {trgt=1}             
## [6] {ttype=Police}                         => {trgt=1}             
##     support   confidence lift     count
## [1] 0.1253491 0.9521349  2.267297 7002 
## [2] 0.1202649 0.9475317  2.256336 6718 
## [3] 0.1259040 0.6421072  1.479341 7033 
## [4] 0.1061941 0.5083555  1.370499 5932 
## [5] 0.1754565 0.7378604  1.272048 9801 
## [6] 0.1384891 0.7062905  1.217622 7736

What comes out of the previous analysis is that the infrastructure attacks doesn’t have deaths, as well as attacks with incendiary weapons. On the other side assasination (of course) and attacks to the police brings deaths. The results are very interesting, but we want to investigate what happens with the “explosion”. So what we are going to do is fix it as the “right-hand-side” and then as the “left-hand-side”. Let’s have a look of what are the rules that implies an “Explosion/Bombing”:

ExpBomItems = grep("atype=Bombing/Explosion", itemLabels(tData), value = TRUE)

rules <- apriori(data=tData, parameter=list(supp=0.1,conf = 0.5, minlen = 2), appearance=list(default="lhs",rhs=ExpBomItems), control = list (verbose=F)) 
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                    rhs                         support confidence     lift count
## [1] {wtype=Explosives,                                                                                  
##      ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1370981  0.9736300 1.770019 17021
## [2] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives,                                                                                  
##      trgt=1}                            => {atype=Bombing/Explosion} 0.1019637  0.9598150 1.744904 12659
## [3] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives,                                                                                  
##      gname=Unknown}                     => {atype=Bombing/Explosion} 0.1366551  0.9589645 1.743357 16966
## [4] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives}                  => {atype=Bombing/Explosion} 0.1951318  0.9539673 1.734273 24226
## [5] {wtype=Explosives,                                                                                  
##      trgt=0}                            => {atype=Bombing/Explosion} 0.3572153  0.9524106 1.731443 44349
## [6] {wtype=Explosives,                                                                                  
##      trgt=1,                                                                                            
##      gname=Unknown}                     => {atype=Bombing/Explosion} 0.1075698  0.9503985 1.727785 13355

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                                    rhs                         support confidence     lift count
## [1] {wtype=Explosives,                                                                                  
##      ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1370981  0.9736300 1.770019 17021
## [2] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives,                                                                                  
##      trgt=1}                            => {atype=Bombing/Explosion} 0.1019637  0.9598150 1.744904 12659
## [3] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives,                                                                                  
##      gname=Unknown}                     => {atype=Bombing/Explosion} 0.1366551  0.9589645 1.743357 16966
## [4] {rgn=Middle East & North Africa,                                                                    
##      wtype=Explosives}                  => {atype=Bombing/Explosion} 0.1951318  0.9539673 1.734273 24226
## [5] {wtype=Explosives,                                                                                  
##      trgt=0}                            => {atype=Bombing/Explosion} 0.3572153  0.9524106 1.731443 44349
## [6] {wtype=Explosives,                                                                                  
##      trgt=1,                                                                                            
##      gname=Unknown}                     => {atype=Bombing/Explosion} 0.1075698  0.9503985 1.727785 13355

It seems that explosives weapon in Middle East and North Africa implies with a near-1 level of confidence Bombing/Explosion events. Let’s see what happens dropping the redundant rules.

#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)

## [1] 14

rules <- rules[-subsetRules] # remove subset rules.
rules_conf <- sort(rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                                    rhs                         support confidence      lift count
## [1] {wtype=Explosives}                  => {atype=Bombing/Explosion} 0.5426493  0.9439681 1.7160944 67371
## [2] {rgn=Middle East & North Africa}    => {atype=Bombing/Explosion} 0.1962030  0.6761130 1.2291452 24359
## [3] {trgt=0}                            => {atype=Bombing/Explosion} 0.3622817  0.6572272 1.1948115 44978
## [4] {gname=Unknown}                     => {atype=Bombing/Explosion} 0.3009295  0.6432568 1.1694140 37361
## [5] {ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1382257  0.5371878 0.9765849 17161
## [6] {rgn=South Asia}                    => {atype=Bombing/Explosion} 0.1446292  0.5148526 0.9359805 17956

The results are a bit more clear since it’s evident that events in Middle East and North Africa are done by group whose name is unknown.

Let’s see now what implies a Bombing/Explosion attack:

rules <- apriori(data=tData, parameter=list(supp=0.1,conf = 0.5, minlen = 2), appearance=list(default="rhs",lhs=ExpBomItems), control = list (verbose=F)) 
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules

##     lhs                          rhs                support   confidence
## [1] {atype=Bombing/Explosion} => {wtype=Explosives} 0.5426493 0.9865138 
## [2] {atype=Bombing/Explosion} => {trgt=0}           0.3622817 0.6586130 
## [3] {atype=Bombing/Explosion} => {gname=Unknown}    0.3009295 0.5470773 
##     lift     count
## [1] 1.716094 67371
## [2] 1.194812 44978
## [3] 1.169414 37361

Strangely it happens that with a 65% of confidence it seems that Bombings doesn’t imply deaths.

rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules

##     lhs                          rhs                support   confidence
## [1] {atype=Bombing/Explosion} => {wtype=Explosives} 0.5426493 0.9865138 
## [2] {atype=Bombing/Explosion} => {trgt=0}           0.3622817 0.6586130 
## [3] {atype=Bombing/Explosion} => {gname=Unknown}    0.3009295 0.5470773 
##     lift     count
## [1] 1.716094 67371
## [2] 1.194812 44978
## [3] 1.169414 37361

Final observations and conclusions

After the study, some assertions are made about the results obtained and future recommendations to address the analysis of data on terrorism.

Faced with the results of the different models, a good prediction of the dependent variables is highlighted, however, information that is so useful as to help the security and population organizations is not extracted. Therefore, it is recommended to link the existing data with other political information data of countries and news that allow to examine the tension associated with terrorism and the possibility of attack.
On the other hand, the clustering model yields interesting results on the types of attack and characteristics of the various groups. Against this, it is recommended a segmentation by different decades that allows observing the evolution of terrorism over time and some current characteristics about the attacks and terrorist groups that operate today.
Finally, the value obtained in the analysis of association rules is highlighted. With these techniques, a lot of information obtained is observed and it is not easy to see, due to the large amount of data present. However, this analysis was done at the end of the work and after the predictive models. It is recommended to do this stage during the exploratory analysis and before making a predictive model, since the association rules provide an adequate guide for the preprocessing of data and to focus the models of segmentation and prediction that can be done about terrorism.

Reporte_Hito_III

Luca Aiello, Felipe Jorquera, Alejandro Soto

28 junio 2019